Phonological distance measures 1 Running head: PHONOLOGICAL DISTANCE MEASURES Phonological Distance Measures
نویسندگان
چکیده
Phonological distance can be measured computationally using formally specified algorithms. This work investigates two such measures, one developed by Nerbonne and Heeringa (1997) based on Levenshtein distance (Levenshtein, 1965) and the other an adaptation of Dunning’s (1994) language classifier that uses maximum likelihood distance. These two measures are compared against näıve transcriptions of the speech of pediatric cochlear implant users. The new measure, maximum likelihood distance, correlates highly with Levenshtein distance and näıve transcriptions; results from this corpus are easier to obtain since cochlear implant speech has a lower intelligiblity than the usually high intelligibility of the speech of a different dialect. Phonological distance measures 3 Phonological Distance Measures Measuring linguistic distance has proceeded in a number of ways over the last two centuries. Early methods were only applicable to a specific area; in dialectology, for example, Chambers and Trudgill (1998) give the example of drawing dialect boundaries using cognate sets. By the mid-twentieth century, computers were influential enough that they were used to implement some numerical measures, as in the groundbreaking work of Séguy (1973), which began in the 1950’s. This present work continues the recent line of investigation using Levenshtein distance that was begun by Kessler (1995) and is best exemplified by Nerbonne and Heeringa (1997). It also looks to statistical, probabilistic methods that require even less linguistic knowledge and allow even more general applicability. For example, the Levenshtein distance measure used by Nerbonne and Heeringa (1997) allows comparison of any two identical word lists. The probabilistic method developed here is based on a maximum likelihood estimator language classifier explained by Dunning (1994). It allows an estimator trained on an arbitrary input to classify any other corpus, and it can do this with less linguistic knowledge built in to the algorithm. Both of these measures produce a single scalar which we call “phonological distance.” Phonological distance can be measured between any two corpora of phonetic data, so, for example, distances between speech of implant users and adult American English could be compared to the distance between English second language learner speech and adult American English. This is a result of reducing the number of linguistic assumptions: wider applicability. In this paper, we measure phonological distance between the speech of pediatric cochlear implant users and adult American English speakers who have normal hearing. This measurement results in distances that are easier to measure against a human baseline Phonological distance measures 4 than previous work because of the relatively low intelligibility between cochlear implant users and näıve listeners. Previous work such as Gooskens and Heeringa (2004) and Heeringa (2004) has had to work around a possible ceiling effect caused by the high mutual intelligibility of national dialects. In addition, we hope to provide another way to measure progress in cochlear implant user development. These two algorithms each produce a single scalar, similar to existing measures such as the Goldman-Fristoe Test of Articulation (Goldman & Fristoe, 1986). However, these measures produce their results by a fixed algorithm—without any biases or intuitions provided by human calculation except those encoded into the algorithm where they can be examined. Yet the results still correlate well with human perception of intelligibility. Of course, for these algorithms to be usable in the same way as the Goldman-Fristoe Test of Articulation, they would have to be extensively normed over multiple groups. Dialectology measures the variation of language over an area or space of time (Chambers & Trudgill, 1998). Its quantitative application is known as dialectometry, which began in earnest with the ground-breaking work of Séguy (1973) in determining dialect distances in the French region of Gascony. Indeed, the idea of calling this measure “phonological distance” comes from the different areas of language that Séguy combined to form his overall distance. Since then, dialectometry has continued to evolve towards methods that minimize the linguistic knowledge required as input. Séguy’s own work was completed after the widespread availability of computers; although the specific phonological characteristics were hand-picked, the method for combining differences to determine distance was mathematically specified. More recently, Nerbonne and Heeringa (1997) used Levenshtein distance, for which the linguistic knowledge necessary is limited to specification of phonological data in terms of phones made of feature bundles, and the assurance that the Phonological distance measures 5 two different corpora represent the same underlying forms. Levenshtein distance is used in many fields wherever a measure of similarity between sequences is needed. In bioinformatics, for example, Levenshtein distance is useful for finding similarity between sequences of DNA (Sankoff & Kruskal, 1983). Its wide applicability lies in the fact that it only needs specification of costs between individual items of the sequence. The algorithm specifies how to combine these costs to find the lowest total distance. Heeringa (2004) gives two specifications for these costs. The earlier proposal, implemented in this paper, uses phonological specification of the costs in terms of number of features changed. The more recent proposal uses phonetic correlates (that is, F1, F2 and F3 measured in Barks) to determine the distance between two segments. Dunning’s (1994) work on probabilistic language classification provides a starting point for a probabilistic distance measure. Dunning uses a maximum likelihood classifier trained on an n-gram Markov model of language. This produces an estimated likelihood that the training corpus generated the test corpus. He then classifies the language of test corpora by the language of the closest training corpus. This can be viewed as a distance measure by retaining the numerical result and reversing the question asked. Instead of training multiple models, only train one designated as the target language. The likelihood of each test corpus can now be seen as a distance. The reason to prefer such opaque measures is that they obscure, and thus minimize the need for, the knowledge required to obtain the result. This is important to allow the algorithm to be implemented on a computer. We tested both algorithms and compared the results to human judgments of intelligibility. Phonological distance measures 6
منابع مشابه
Mutual intelligibility of Chinese dialects: Predicting cross-dialect word intelligibility from lexical and phonological similarity
This paper aims to predict mutual intelligibility (defined here as cross-dialectal word recognition) between 15 Chinese dialects from lexical and phonological distance measures. Distances were measured on the stimulus materials used in the experiment. Their predictive power was compared with earlier similar distance measures based on large word lists. Predictors based on just the stimulus mater...
متن کاملRanking severity of speech errors by their phonological impact in context
Children with speech disorders often present with systematic speech error patterns. In clinical assessments of speech disorders, evaluating the severity of the disorder is central. Current measures of severity have limited sensitivity to factors like the frequency of the target sounds in the child’s language and the degree of phonological diversity, which are factors that can be assumed to affe...
متن کاملASR, dialects, and acoustic/phonological distances
If the acoustic models in an ASR system have been built using standard pronunciations in the acoustic training database, dialect speakers usually show in a test a lower ASR performance compared to speakers of standard pronunciations. In this paper, this degree of degradation is considered to be a measure for the distance between dialect and standard pronunciation. We relate this ASR-distance wi...
متن کاملWords cluster phonetically beyond phonotactic regularities.
Recent evidence suggests that cognitive pressures associated with language acquisition and use could affect the organization of the lexicon. On one hand, consistent with noisy channel models of language (e.g., Levy, 2008), the phonological distance between wordforms should be maximized to avoid perceptual confusability (a pressure for dispersion). On the other hand, a lexicon with high phonolog...
متن کاملEvaluation Of String Distance Algorithms For Dialectology
We examine various string distance measures for suitability in modeling dialect distance, especially its perception. We find measures superior which do not normalize for word length, but which are are sensitive to order. We likewise find evidence for the superiority of measures which incorporate a sensitivity to phonological context, realized in the form of n-grams— although we cannot identify ...
متن کامل